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Abstract 

Many sorts of structured data are commonly stored in a 
multi-relational format of interrelated tables. Under this re- 
lational model, exploratory data analysis can be done by us- 
ing relational queries. As an example, in the Internet Movie 
Database (IMDb) a query can be used to check whether the 
average rank of action movies is higher than the average 
rank of drama movies. 

We consider the problem of assessing whether the re- 
sults returned by such a query are statistically significant or 
just a random artifact of the structure in the data. Our ap- 
proach is based on randomizing the tables occurring in the 
queries and repeating the original query on the randomized 
tables. It turns out that there is no unique way of random- 
izing in multi-relational data. We propose several random- 
ization techniques, study their properties, and show how to 
find out which queries or hypotheses about our data result in 
statistically significant information. We give results on real 
and generated data and show how the significance of some 
queries vary between different randomizations. 

1 Introduction 

The question of evaluating whether certain hypotheses made 
from observed data are significant or not, is one of the old- 
est problems in statistics. Statistical significance reduces an 
observed result (statistic) to a p-value that tells about the 
probability of observing the same result at random when a 
certain null hypothesis is true. If this p-value is sufficiently 
small, we can assume that the null hypothesis is false. The 
technical challenge of defining an exact p-value for a given 
hypothesis is typically resolved by studying the null distri- 
bution of the test analytically; for example, the well known 
chi-squared test is based on statistics that follow a chi-square 
distribution under the null hypothesis. Alternatively, when 
analytical solutions are not possible or hard to state exactly, 
the null distribution can be defined via permutation tests. 

These useful statistical concepts have been used for years 
in experimental fields such as medicine, biology, geology or 
physics, to name a few. Many of these considerations have 
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Figure 1 : A toy example of a multi -relational database con- 
sisting of three binary relations: movies classified by genre, 
GM; directors of movies, MD; and ages of directors, DA. 



been extended as well to the data mining and database com- 
munity. In a very first paper about association rules, Brin 
et al. fl4l considered measuring the significance of rules 
via the chi-squared test, and from there many other papers 
followed — see e.g. |15| for a comprehensive survey. More 
recently, the approach of defining randomization tests to as- 
sess data mining results was introduced for binary data |i8l, 
and for real-valued data lfT2l . 

Abstracting a bit from the question of how significant pat- 
terns are in the data, we introduce here the statistical testing 
framework to databases and the exploratory task of querying 
the relations of the database. The question of understand- 
ing what we know and what we believe about our dataset 
becomes tricky when the data is highly structured and in- 
terrelated. Structured data is everywhere: examples are the 
Internet Movie Database (IMDb), or the DBLP computer 
science bibliography, and indeed, most of today's informa- 
tion systems are actually relational databases. In IMDb, 
e.g., basic entities are directors, movies, genres, ranks or 
years; in addition, we have relations such as directors direct 
movies, movies are classified by a genre, movies are ranked 
with some quality criteria, and directors are born in a cer- 
tain year Each of these relations is represented in a separate 
table which relates to others through their common attribute 
values. A simple toy example is given in Figure[T] 

In multi-relational databases, users and applications ac- 
cess the data via queries. E.g., a query can be made to check 
the average age of directors of history movies, or the aver- 
age age of directors of romance movies. In the toy exam- 
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pie of Figure [T] the first query returns a value of 60, while 
the second query returns a value of 30. Usually, the answer 
returned by the query is assumed as a fact, thus implying 
some conventional wisdom — for this toy example we might 
be tempted to believe that directors of romance movies are 
younger than directors of history movies. But, should we 
really believe that this hypothesis is significant from the 
data? If we knew that all history movies are also classi- 
fied as drama movies, would the value of 60 still have the 
same importance? Or, if we knew that the same director has 
participated in both romance and drama movies? 

We study whether the results returned by queries are sig- 
nificant or just a random artifact due to the structure in 
the data. Our statistical tool is randomizations and the ap- 
proach is simple: randomize certain relations occurring in 
the queries and repeat the original query in the random sam- 
ples. This provides an empirical p- value, and, as in basic 
statistics, we can reject or accept our hypothesis linked to 
the query. The goal behind this idea is to provide an un- 
derstanding of how the structure of the data affects the sig- 
nificance of the information we derive from our queries. If 
certain structures or patterns remain after simple random- 
izations (e.g., the fact that history movies are also drama 
movies in the toy example), the answers of a query that rely 
on such patterns should be regarded as not significant. 

It turns out that there is no unique way of randomizing 
in multi-relational data, and indeed, it is difficult to give 
a fully satisfactory answer about which randomizations are 
more important than others. We study several randomization 
methods and show the combinatorial properties of the null 
distributions on multiple tables. Our contribution makes a 
first step towards understanding how the significance of a 
query is linked to the structure hidden in the data; random- 
izations are a sound statistical tool to make such a connec- 
tion. We believe this is an important problem of interest to 
both the database and data mining communities. We present 
experimental results on synthetic data, and show the usabil- 
ity of the method for several queries in real datasets. 

2 Problem statement 

Let A be a binary relation A C I x J between sets / and J. 
In the market basket application, for example, / could be a 
set of customers and J a set of products. A binary relation 
A C I X J identifies which customers from / buy which 
products from J. Notice that every binary relation can be 
seen as a binary matrix describing the occurrences between 
the row set / and column set J, see Figure|2]for examples. 

Let {Ai, . . . , An } be a set of n binary relations represent- 
ing some structured data. This relational model is very gen- 
eral. It applies, for example, to a movie database system, as 
shown in Figure |2l The representation of the same example 
as a sequence of bipartite graphs is depicted in Figure[3] 

The basic operator to combine relations is the natural 
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Figure 2: The binary table representation of the toy database 
in FigurelD (a) GM; (b) MD; and (c) DA. 
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Figure 3: The bipartite graph representation of the movie 
database shown in Figure |2] The graph shows all the possi- 
ble paths from the source nodes. Genre, to the destination 
nodes. Age. 



join. Conceptually, a join between two relations A and B, 
denoted j4 N i?, combines all entries from A and B that share 
common attribute values to return a composition of the re- 
lations. For example, given G A, {j,k) e B and 
(j, k') e B, we have k) G AxiB and also (i, j, k') e 
Ai<B. The join operator is associative over a set of relations 
and its result explicitly represents all existing paths between 
the occurring relations. For example, the natural join of the 
three tables in Figure |2] returns a tuple for each path there 
is between Genre and Age. For an ordered subset of binary 
relations from the database S C {Ai, . . . , An}, we use kS 
to denote the final join between all elements in S. The order 
in S is to ensure a join of consistent relations; we assume 
that S* in n S* is always implicitly ordered. 

A query q is applied to the join of a subset of the relations 
in the database S C {Ai, . . . , An}- The result of a query 
is denoted by q{xiS). We say that S is the set of relations 
occurring in the query. A query can be described with the 
operators of projection and selection ifTsl . applied to a join 
M S. Projection is a unary operator ttx ( n S) that restricts 
tuples of X 5 to attributes in X. Selection is a unary operator 
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(Ty ( M S*) where ip is a propositional formula. The operator 
selects all tuples in the relation mS for which ip holds. 

Consider the movie database in Figure |2] A possible 
query is: select drama movies and project movie and age 
of its director We can write this query as follows. 



Ql — ""Movie.Age 



enre = Drama 



(GMxMDixDA)) 



The result of query qi is a set of pairs: {(7713, 30), (7714, 30), 
(777.5, 30), (7776, 60), (7777, 60)}. Another very similar query 
is: select drama movies and project age only. That is, 

<72 = 7rAge((TGenre = Draraa(GMlxMDt><lDA)) 

Query q2 returns:{30, 60}. Although queries qi and q2 are 
very similar, the projection made by q2 on only Age, has 
eliminated repeated values. The results of query qi tell us 
how many paths there are between directors of Drama and 
Age, while in query 52 we only know if a path exists or not. 

Our goal is to assess whether the results returned by a 
query provide significant information about our hypothesis 
on the data. For simplicity, a statistic f is required to map 
the results of a query to a single real value. We assume this 
function / is provided by the user together with the query; 
they define the hypothesis on the data the user wants to test. 
Examples of this statistic are the average of the returned re- 
sults, or the number of tuples in the answer, but indeed / 
can be any general function returning a real value. 

For example, the average value of Age in query qi is 42.5 
(i.e., the average age of directors of Drama weighted by the 
number of directed movies). Then, we may want to know 
whether that average age is interesting or not. Another two- 
tailed hypothesis is whether that average is significantly dif- 
ferent from the average age of directors of romance movies. 

Formally, our problem reads as follows. 

Problem 1 Given a set of binary relations {Ai, . . . , An} 
of structured data and a query q on some occurring S C 
{Ai, . . . , An}, is the value of f{q{txiS)) for a statistic f, 
significant ( in some sense to be made more specific later) ? 

3 Overview of the method 

In this section we present an overview of the approach and 
describe the intuition behind it. We show how our method 
can be used to test the significance of queries and to uncover 
the structurally important relations in the data. 

3.1 Significance testing via randomizations 

We approach the problem of testing the statistical signifi- 
cance of the query via randomizations. 

Randomizations have been widely used as a method to 
generate samples from null distributions. For example, in 
medical studies it is customary to measure the effect of a 







DU 




oU 


DU 


Romance 


1 





Romance 


2 





Drama 


1 


1 


Drama 


3 


2 


History 





1 


History 





2 



(a) GM • MD • DA 



(b) GM * MD * DA 



Figure 4: (a) Binary relation Genre x Age obtained via 
boolean product between GM • MD • DA of Figure |2l (b) 
Contingency table of paths between Genre and Age ob- 
tained via matrix product of GM * MD * DA. 



certain drug via permutation tests between the control and 
case group |9|. 

For short, let i? =nS' for some S C {Ai, . . . ,An}- To 
assess the significance of f{q{R)), we generate randomized 
versions of R and run the same query over the samples. Let 
TZ = . . . , _Rfc} be a set of randomizations of R. We 
will specify in Section 14.11 how to generate such random- 
ized versions of R. Then the one-tailed empirical p-value of 
f{q{R)) with the hypothesis of f{q{R)) being small is. 



\{RGn:f{q{R))<f{q{R))}\ + l 



(1) 



This definition represents the fraction of randomized sam- 
ples having a smaller value of the statistic /. If the p-value 
is small, e.g., below a threshold value a = 0.05, we can say 
that the value of f{q{R)) is significant in the original data. 
The one-tailed p-value with the hypothesis of / being large 
and the two-tailed p-value are defined similarly. 

3.2 Where to randomize? 

The challenge is how to generate the set TZ, that is, the differ- 
ent randomized versions of R — MS,to compute the empiri- 
cal p-value. Consider the toy example in Figure|2] Suppose 
we want to evaluate whether the average age of the direc- 
tors of drama movies, as in query q2 of Section|2l is young. 
A first naive approach is to consider randomizing directly 
the binary matrix obtained from the boolean product of all 
relations from Genre to Age. The boolean product tells us 
whether there is a path from the set of nodes of Genre to the 
set of nodes of Age, as required by query (72. 

A traditional permutation tesfl on this new matrix shown 
in Figure |4ja) can produce only two possible random sam- 
ples: either the original matrix, or a matrix where the age 
values between Romance and History are swapped. For the 
particular case of romance movies with the hypothesis of 
having small age, we would obtain a p-value close to 0.5 
(i.e. 50% of the randomized samples would have the same 
value as the original). Thus the result is not significant. In- 



'A traditional permutation test would swap any values in the matrix, 
while keeping the row and column sums fixed. In binary data this is called 
swap randomization. 



3 



Algorithm 1 Query significance in multi-relational data 
Input: A set of binary relations S C {^i, . . . , An}, a que- 
ry q{MS) and a hypothesis over the statistic f{q{MS)) 
Output: A set of p-values 
1: for each binary relation A e S do 
2: Obtain k random samples of A, {Ai, . . . ,Ak} 
3: LetnA^{^T^A\AeAandT^S\A} 
4: Compute the p-value using the random samples TZa 
5: end for 



deed, under such randomization none of the three genres 
would test significantly small, nor large, nor different. 

Alternatively, we could apply a permutation test on the 
contingency table of paths |j4|, shown in Figure |4|b). This 
table gives the number of paths between the Genre and Age, 
as required by qi. The hypothesis related to our queries un- 
der those permutation tests would never be significant. 

The problem of these naive approaches is that they ignore 
the structure of the relations occurring in the query. In our 
toy example there are three binary relations participating in 
the query: GM, MD and DA. Indeed these relationships con- 
vey some structure on the data: the relation MD shows that 
all history movies are also drama movies; the relation MD 
shows that all movies from Drama and Romance have been 
directed by the same person. How do these structures affect 
the significance of the results in a query? 

In queries involving multiple binary relations, there is no 
unique way to randomize. To assess the structural effect that 
each relation from S has over the query q{MS), we should 
randomize only the corresponding relation. That is, the dif- 
ferent randomizations of x S* are obtained by randomizing a 
single relation A E S while keeping the rest fixed. 

More formally, the random samples of mS, when only 
j4 e 5 is randomized, are defined as follows: 

Ua = {xTuA\ AeAandT^ S\A}, 

where A = {Ai , . . . , Ak} is the set of randomized versions 
of the original A E S. In Section l4~n we describe the dif- 
ferent randomization techniques to obtain such samples. Fi- 
nally, these randomized samples TZa will be used to com- 
pute the corresponding p-value, as described in Equation[T] 

Observe that for a given query involving relations in S, 
we can obtain one p-value for each ^ e S* we randomize 
(while keeping S\A fixed). Each p-value is interesting as it 
measures the structural effect that the participant relation A 
has on the significance of the result of the query. 

The sketch of the method is described in Algorithm [T] 
The basis of our proposal can be found in traditional 
statistics under the name of restricted randomizations (see 
e.g. |9|, typically to test whether a treatment variable has 
effect on a response variable). 



3.3 Example 

We study now the toy example in Figure|2] Consider a query 
defined such as qi from Section|2l yet on the three different 
Genres. 

The first hypothesis that romance movies are directed by 
young directors obtains a p-value of 0.131 when random- 
izing on GM, a p-value of 0.494 on MD, and a p-value of 
0.495 on MA. The hypothesis is not significant under any 
randomization, but we observe that randomizing on GM ob- 
tains the smallest p-value for this query. 

The hypothesis that history movies are directed by old di- 
rectors obtains p-values 0.269, 0.045, 0.495 when random- 
izing on GM, MD, DA, respectively. Thus the hypothesis 
is significant considering the structure in relation MD: all 
non-history movies are directed by the same person. 

Finally, the hypothesis that drama movies are directed by 
young directors is not significant in any of the randomiza- 
tions, always with a p-value close to 1 when randomizing 
on GM or MD, and p-value of 0.495 when randomizing on 
DA. 

In summary: the age value of 30 associated to romance 
movies is close to being significant when randomizing on 
GM because Romance is a non-intersecting genre with 
Drama and History; the age value of 60 associated to history 
movies is significant when randomizing on MD because, 
when focusing on the directors, the history movies are non- 
intersecting with the romance and drama movies — all ro- 
mance and drama movies are directed by the same person; 
also, the relation DA always swaps with equal probability, 
because of its one-to-one structure. In the next section we 
will understand better the reason of these explanations. 

4 Randomizations in multi-relational 
model 

This section describes how to obtain random samples for a 
single relation A (line [2] in Algorithm [U, and presents the 
combinatorial properties of combining such samples with 
the other relations in the query (line[3]in Algorithm[T]). 

4.1 Types of randomization 

Given a binary relation A we use three different types of ran- 
domization to obtain random samples from A. The running 
times and space consumptions of the methods are linear in 
the size of the relation A. 

(1) Swap randomization of A, as used in ||5][8], produces 
random samples of A that preserve the row and col- 
umn sums. The algorithm starts from A and performs 
local swaps interchanging a pair of 1 's with a pair of O's 
preserving the row and column sums. Technically, a lo- 
cal swap consists of selecting entries (fc, I) E A 
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such that (i, I), {k,j) ^ A, and swapping the elements 
so that (z, j), (fc, I) <^ A and (i, (fc, j) e A. On the 
bipartite graph representation of the relation A, a local 
swap represents a flip between two independent edges. 

i • • j i V ^ j 

k .1 k % 1 

A sequence of swaps is performed until the data mixes 
sufficiently enough in a Markov chain approach [HE], 
and therefore, a random sample of A is obtained. We 
use ten times the number of ones in the matrix as the 
number of swaps, which suffices for the convergence of 
the chain fS) . We denote the set of all random samples 
reached via swap randomization of A as sw(A). 

(2) Row permutation of A permutes the order of the rows 
of A. We denote the set of all random samples reached 
via row permutation of A as rp(yl). 

(3) Column permutation of A permutes the order of the 
columns of A. We denote the set of all random samples 
reached via column permutation of A as cp(yl). 

Note, particularly, that s\n{A), tp{A) and cp{A) refer to sets 
of matrices. The relationship between swap randomizations 
and permutations can be stated as follows. 

Proposition 1 Let Abe a binary matrix. Then: 

• rp(A) = sw(/) • A, where I is an identity matrix; 

• cp{A) = A ■ sw(/), where I is an identity matrix; 

• if A has one 1 in each row, then sw(A) = i'p(^); if A 
has one 1 in each column, then sw(yl) = cp(y4). 

Note that sw(/), for identity matrix /, can produce any 
swap permutation matrix with uniform distribution. Thus, 
we have that the boolean product sw(/) • A produces all per- 
mutations for the rows of A and similarly, A ■ sw(/) pro- 
duces all permutations of the columns of A. Intuitively, 
these row (or column) permutations can be seen as a ran- 
dom re-assignment of the row (or column) names in A. 

WTiile the swap randomization has been used in fS) to as- 
sess the data mining results on a single binary relation, the 
new randomizations, corresponding to row and column per- 
mutations, do not make sense in such a context. The row 
or column permutation of a matrix does not change any of 
the frequent pattern solutions in the new randomized matrix. 
These permutations only make sense in a multi-relational 
data model, where the permuted matrices are combined with 
other relations. Both row and column permutation of a sin- 
gle relation change the global paths from the source nodes 
to destination nodes in the query graph, and thus, the evalu- 
ation of the query can change on the randomized data. 



4.2 Properties 

Next we study the properties of combining the obtained ran- 
dom samples with the other relations in the query. For sim- 
plicity, we study the case of queries with only two occurring 
relations q{A^B) and use boolean product as a simplifi- 
cation of the natural join. For notational convenience, we 
overload the boolean product for the sets of binary matrices, 
e.g., sw(A) ■ sw{B) represents the boolean product of each 
pair of elements A € sw(^) and B G s\n{B). 

The following inclusions with swap randomization follow 
immediately after the definitions. All other inclusions do not 
hold. The inclusions can also be proper in all cases. 

Proposition 2 Let A, B be binary matrices. Then: 

• A- B C sw{A) ■ B C sw{A) ■ sw{B); 

• A- B CA- sw{B) C sw{A) ■ sw{B); 

• A- B C sw{A ■ B). 

Proposition [2] tells us that the set of samples that can be 
obtained by randomizing two relations is larger than by ran- 
domizing only one relation. As discussed in Section 13.21 
we prefer to randomize a single table at a time in order to 
control much better the structural effect the randomized re- 
lation has on the query. Additionally, we know that the set 
of randomized samples sw(A) ■ B is different from the set 
A - s\n{B), thus it makes sense to do them both separately. 

Next we present several properties relating swap random- 
ization to row and column permutations. 

Proposition 3 Let A, B be binary relations. If B is a one- 
to-one relation, then A ■ s\n[B) = cp(A). If A is a one-to- 
one relation, then sw(^) ■ B = rp(_B). 

Proposition[3]follows immediately from Proposition[T| In 
real world datasets, it is quite common to have one-to-one 
relations. For example, the ages of the directors in the ex- 
ample in Figure|2]are one-to-one. Thus swap randomization 
of the relation DA produces the same set of samples as the 
column permutation of MD. 

Proposition 4 Let A, B binary relations. Then: 

• cp(A ■ B)= A- cp{B) 

• rp{A ■ B) = rp{A) ■ B 

• cp(A) - B^ A - rp(B) = cp(A) • rp(B) 

This means that column and row permutations do not make 
sense in more than one relation, e.g., A ■ cp{B - C ■ D) ■ E — 
A ■ B ■ C ■ cp(D) • E. The last property of Proposition |4] 
states that only one permutation, either column permutation 
on A or row permutation on B, is indeed necessary. 

Finally, we give an implication of Proposition [T] that re- 
duces the number of different randomizations considerably. 
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Theorem 1 Let A, B be binary relations. Then: A ■ sw(/) • 
B = cp(yl) ■ B = A ■ rp{B), where I is an identity matrix. 

Hence, we prefer to use the notation with the identity matrix 
/ to refer to the row and column permutations. The opera- 
tion A ■ sw(/) • B randomizes the boolean product, whereas 
the operations s\n{A) ■ B and A ■ sw{B) randomize the orig- 
inal data. From this perspective, A ■ sw(/) • B tells about the 
significance of the combination operation, while sw{A) ■ B 
tells whether the structure in A is significant. To sum up, we 
have the following result: 

Corollary 1 For a query q(A\>^B), there exist three differ- 
ent randomizations: (i) s\n{A) while keeping B fixed; (ii) 
sw{B) while keeping A fixed; (Hi) sw(/) where I is an iden- 
tity relation between the columns of A and the rows of B. 

Notice that if ^ or B are one-to-one relations, then random- 
ization (iii) will be the same as (i) or (ii) respectively. Each 
randomization provides a set of samples from where we can 
compute a p-value for our query (hypothesis on the data). 
Every p-value is interesting as it shows how the structure of 
the randomized relation affects the significance. 

4.3 Example revisited 

The p-values reported in Section 13.31 for the toy example 
in Figure|2] correspond to swap randomization of the binary 
tables GM, or MD, or DA, respectively. Indeed because MD 
has one single 1 in each row, we have that GM • sw(/) • MD • 
DA is equal to GM • sw(MD) • DA. Similaily, because DA 
is a one-to-one relation, we have GM ■ MD • sw(/) • DA 
equals GM • MD • sw(DA). Thus, for this example, only 
swap randomization in the three tables is necessary. 

Interestingly, we can understand better now the p-values 
reported in Section [33] On the relation GM, drama movies 
and history movies have no independent edges to swap be- 
tween them. Therefore, the pattern of History implying 
Drama tends to remain in random samples. As a result, the 
p-value of the hypothesis related to history or drama movies 
is not significant. On the other hand, the p-value related to 
romance movies becomes close to being significant because, 
for this genre, the null distribution diverges more from the 
original. The fact that there are only two romance movies 
raises this p-value slightly above the 0.05 threshold. 

Similar explanation goes when randomizing MD. When 
looking at MD, local swaps can interchange at most two 
edges between movies of the young director C. Waitt and 
movies of the not-so-young director T. George. Actually, 
in all random samples coming from MD we observe that 
C. Waitt has always at least three movies from either drama 
or romance. As a result, neither drama nor romance can be 
significant — in the null distribution they are always closely 
linked to a young director as in the original data. Yet, his- 
tory movies directed by T. George have more local swaps 
that would create a diverging null distribution — most of the 



samples in the null distribution have the history movies con- 
nected to the age of 30. The hypothesis of history movies 
being directed by a not-so-young person is then significant. 

5 Studying path distributions 

For a query g(Aix. . .m_B) where A C /xi and B C JxK, 
let P = A * ... * B be the matrix product of all relations 
participating in q. This corresponds to the contingency table 
of paths from origin I to destination nodes K. An example 
is shown in Figure |4jb) for the toy data of Figure |2] For 
all types of queries, the significance of the result is closely 
related to the path distributions between nodes / and K. For 
example, suppose we want to test whether the average age 
of history-movie directors is large. In the original data of 
Figure [3] there are two paths from History to the age of 60 
and no path to the age of 30. It is sensible to assume that 
if we had random samples where paths are mainly swapped 
the other way round, the hypothesis would be significant. 

Naturally, a simple way to visualize whether there ex- 
ists an interesting finding in the data is to compare the path 
distribution of P with the expected path distribution on the 
given random samples. The larger the change, the more sig- 
nificant the result would tend to be. 

The following three matrices show the expectation of the 
paths when swap randomizing relation GM, MD or DA, 
respectively, for the example in Figure|2] 

£;[sw(GM) * MD * DA] £;[GM * sw(MD) * DA] £;[GM * MD * 5w(DA)] 

0.849 1.15l\ / 1.413 0.587\ /0.984 1.016\ 
3.269 1.731 3.587 1.413 2.492 2.508 

0.882 1.118/ \1.455 0.545/ \1.016 0.984/ 

The genre that swaps most of its paths under randomiza- 
tions with GM is Romance. History swaps the paths from 
the age of 60 to the age of 30 when randomizing on MD. 
Randomization on DA distributes paths fifty-fifty for each 
genre. Thep-values obtained there were always close to 0.5. 

6 Empirical results 

In this section we present empirical results on synthetic and 
real datasets. Our real dataset is MovieLens, which is very 
similar to IMDb. In all cases, we calculate the empirical p- 
values over 999 randomized samples and use the threshold 
of a = 0.05 to determine the query significance. 

The randomization methods are fast in practice. In our 
experiments, producing one randomized sample took ap- 
proximately the same time as evaluating the query. With 
the tested datasets, the times for producing one sample were 
at most few seconds with Java implementations integrated 
with MATLAB on a 2.2GHz Opteron. The time and space 
consumption of the methods scale linearly in the size of the 
relation. In large-scale applications, fewer number of ran- 
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domized samples can be used to calculate the empirical p- 
values. For example, 30 samples is usually sufficient in a 
preliminary significance analysis. This corresponds to ap- 
proximately 30 times increase in the evaluation time. 

6.1 Synthetic dataset 

To motivate our approach and understand better why ran- 
domizations are consistent with the inferences about our 
hypothesis, we generate a synthetic dataset to simulate re- 
lations of users, movies and genres. We will be interested in 
testing the following hypothesis. 

Hyp 1 Men watch different types of movies than women. 

The relations occurring in the query are: GenderxUser 
(SU), UserxMovie (UM) and MoviexGenre (MG). 

For studying the behavior of randomizations, we generate 
the tables SU, UM and MG to make our hypothesis clearly 
be significant. We let SU contain 30 men and 20 women, 
thus SU is a 2 x 50 binary table where the first 30 values in 
the first row and the last 20 values in the second row are Is. 
We generate UM to be a 50 x 100 binary table where men 
watch any of the first 60 movies with probability of 0.40 and 
any of the last 40 movies with probability of 0.05. To create 
a strong pattern, we let the probabilities of a female watch- 
ing movies be the other way round. Finally, we generate 
MG as a 100 X 6 binary table where the first three genres 
will be considered to be manly and the last three genres will 
be considered to be womanly. For each movie in the rela- 
tion, we select two genres as follows: for the first 60 movies 
we select a genre from the manly genres with a probability 
of 0.9 and from the womanly genres with a probability of 
0. 1 . For the last 40 movies the probabilities are the other 
way round. So, each movie has at most two genres, because 
if we happen to select the same genre for a movie twice, 
then we say that the movie has only one genre. 

Next we create the anti-tables from those above, called 
rSU, rUM and rMG. These anti-tables will not contain any 
structure at all, they are random. We let rSU be a 2 x 50 
binary table with 30 men and 20 women where the order 
of the users is random. We generate rUM to be a 50 x 100 
table with each element being 1 with a probability of (0.40+ 
0.G5)/2. And we let rMG be formed similai'ly to MG but 
with the two genres for each movie assigned uniformly with 
replacement. 

The goal of this experiment is to study how the p-values 
of Hyp [U change when combining the original significant 
tables SU, UM and MG to one of these non-significant ta- 
bles. Figure |5] shows the contingency table of paths from 
those combinations. We notice that using the original tables 
SU, UM and MG (Figure |3 a)) produces clearly a signifi- 
cant difference between the types of movies that males and 
females watch. By replacing one of the original tables with 
a random version, the pattern seems to disappear. Still, we 
cannot clearly see from the path distributions which of the 




m 



G1 G2 G3 G4 G5 G6 



(a) SU*UM*MG 



G1 G2 G3 G4 G5 G6 
(b) rSU*UM*MG 




Gl G2 G3 G4 G5 G6 
(c) SU*rUM*MG 



G1 G2 G3 G4 G5 G6 
(d) SU*UM*rMG 



Figure 5: Proportion of paths going from a gender (M=male, 
F=female) to a genre (G1-G6) in the different combined ta- 
bles. Lighter color represents less paths, while darker more 
paths; to be more exact: white corresponds to the lowest 
value of 4.5% and black to the highest value of 30%. 



Input relations 




p- values 




ABC 


sw(/as) 


SW(B) 


SW(/sc) 


SW(C) 


SU UM MG 


0.001 


0.001 


0.001 


0.001 


rSU UM MG 


0.517 


0.030 


0.013 


0.003 


SU rUM MG 


0.282 


0.279 


0.155 


0.124 


SU UM rMG 


0.001 


0.001 


0.704 


0.727 



Table 1: Significance tests for the Hyp[T]with the combined 
input relations A k B n C. The first three columns contain the 
relations considered as input, labeled A, B and C. Columns 
4th to 7th are empirical p-values for the hypothesis when 
only one relation is randomized: 5w{Iab) randomizes the 
identity matrix between relations A and B, which is equiv- 
alent to randomizing the relation A, sw(A); sw(B) random- 
izes only on relation B; sw{Ibc) randomizes the identity 
matrix between relations B and C; sw(C) randomizes only 
relation C. Bold p-values correspond to randomizations 
which touch the anti-tables. 

underlying tables mainly breaks the original structure. We 
would like to check with our tests whether randomizing in 
the proper tables will tell us where the pattern is broken. 
For the test, we use the following statistic. 

Statistic 1 Li distance between the distribution of genres 
of the movies that men and women have watched. 

This statistic is the sum of the absolute differences between 
the proportion of paths of men and women, as shown in Fig- 
ure|5]for each of the combinations. The original value of the 
statistic with the tables SU, UM and MG is 1.23, implying 
a clear difference between males and females. When one of 
the tables SU, UM and MG is replaced with a corresponding 
anti-table, the value of the Li statistic is around 0.1. 

In Table[T]we show the results of the several significance 
tests for the hypothesis Hyp [T] on the several combined ta- 
bles. There is a clear connection between the structure of 
the relations A, B and C occurring in the query and the p- 
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Relation 


Description 


Rows 


Cols 


# of I's/row 


UM 


User X Movie 


943 


1680 


106 


MG 


Movie X Genre 


1680 


18 


1.7 


UO 


User X Occupation 


943 


21 


1 


US 


User X Gender 


943 


2 


1 


UA 


User X Age 


943 


943 


1 



Table 2: Summary of tables in MovieLens dataset. The table 
UA is an identity map between users and their ages. We 
denote a transpose by reversing the relation name. 

values obtained by randomizing in different relations. As 
expected, the empirical p-value of Hyp [T] with tables SU, 
UM and MG is significant with randomizations in all tables. 
On the other hand, when one of the clearly-structured ta- 
bles SU, UM or MG is replaced by the anti-tables rSU, rUM 
or rMG respectively, we obtain large empirical p-values for 
those randomizations that touch the anti-tables (see the bold 
values of Table [U. This illustrates how randomizations can 
tell about the structural effects in the significance of a query. 

6.2 MovieLens dataset 

The MovieLens data is collected through the MovieLens 
web site (movielens . umn . edu). The downloadable 
data is already cleaned up, i.e., users who had less than 20 
ratings or did not have complete demographic information 
were removed from the data set. In all, the data consists of 
100,000 ratings (valued from 1 to 5) from 943 users on 1,682 
movies. Each user has rated at least 20 movies and the de- 
mographic information for the users correspond to attributes 
of age, gender, occupation and zip code. For each movie we 
have title, release year and a list of genres. Furthermore, we 
interpret that if a user has rated a movie, it means that he 
or she has watched it. This corresponds to the binary table 
named UM. We do not use the information of ratings in any 
other way. In Table |2] we summarize the binary relations in 
the MovieLens dataset. The table UA is just an identity ma- 
trix which maps the users to their ages, thus two different 
columns of the table UA may correspond to the same age. 
Handling numerical values in this way guarantees that two 
users having the same age are not combined into a single 
user after a join and a projection. 

Next, we go through a few queries on the dataset and an- 
alyze their significances. 

Hyp 2 Men watch dijferent types of movies than women. 

Statistic 2 Li distance between the distribution of genres 
of the movies that men and women have watched. 

In Table [3] we give the empirical p-values for Hyp |2] 
Each row shows the relation being randomized for obtaining 
the corresponding p-value. The query associated to the hy- 
pothesis traverses the relations Gender x User x Movie x 





Mean 


(Std) 


p- value 


SUmUMnMG 


0.16 






sw(SU)ixUMnMG 


0.03 


(0.01) 


0.001 


SUnsw(J)nUMnMG 


0.03 


(0.01) 


0.001 


SUnsw(UM)x:MG 


0.01 


(0.00) 


0.001 


SUmUMnsw(J)mMG 


0.03 


(0.01) 


0.001 


SUnUMmsw(MG) 


0.02 


(0.00) 


0.001 



Table 3: Significance evaluation of Hyp|2] Mean and std 
are the average and standard deviation of Statistic |2] in the 
original input data (first row) and several randomizations. 

Genre, corresponding to relations SU, UM and MG. There 
are five different types of randomizations of the query which 
each produce a unique p- value. The results in Table [3] show 
that Hyp|2]is significant wrt all different randomizations. 

Indeed, the results on Hyp |2] seem to indicate that men 
watch movies with different genres than women. All ran- 
domizations are consistent. We will next analyze which 
genres separate men and women. We repeat the following 
hypothesis (with associated query) for each genre G. 

Hyp 3 Men watch genre G more (or less) than women. 

Statistic 3 The difference between the %-proportions of the 
movies from genre G among all the movies men and women 
have watched. 

Notice this statistic is similar to Statistic |2]but now we only 
look at the difference for the specific genre G. The empiri- 
cal p- values of the significance testings of Hyp [3] are given 
in Table |4] Again we find out that randomizing in differ- 
ent relations produces fairly similar results in general. We 
can observe that men watch significantly more, for exam- 
ple, action and sci-fi movies than women, whereas women 
watch significantly more romance and drama movies than 
men. Interestingly, we can say the popularity of mystery 
and documentary movies do not depend on the gender. Ac- 
tually the genres which have the smallest amount of movies 
are the least significant ones. The genres with fewest num- 
ber of movies are fantasy (with 22 movies), film-noir (24), 
western (27), animation (41) and documentary (50). 
Next we study users by their occupation. 

Hyp 4 The users with occupation O watch different types 
of movies than other users. 

Statistic 4 Li distance between the distributions of genres 
of the movies watched by users with occupation O and users 
with other occupations. 

The results of the significance testings are given in Table |5] 
When evaluating the associated query, we find that random- 
izing in different relations matters for that query. For most 
of the occupations. Hyp 2] is not significant when random- 
izing on sw(OU)xiUMixiMG nor OUxisw(/) xUMixiMG. 
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G 


Orig. 


sw(SU) 


SW(/i) 


SW(UM) 


SW(/2) 


SW(MG) 


Action 


2.5 


0.001 


0.001 


0.001 


0.001 


0.001 


Sci-fi 


1.5 


0.001 


0.001 


0.001 


0.001 


0.001 


Tliriller 


1.1 


0.001 


0.001 


0.001 


0.001 


0.001 


Adventure 


0.8 


0.001 


0.001 


0.001 


0.001 


0.001 


Crime 


0.6 


0.002 


0.001 


0.001 


0.001 


0.002 


War 


0.5 


0.002 


0.001 


0.001 


0.004 


0.002 


Horror 


0.4 


0.019 


0.018 


0.001 


0.011 


0.020 


Western 


0.2 


0.001 


0.001 


0.001 


0.005 


0.003 


Film-noir 


0.1 


0.012 


0.009 


0.001 


0.054 


0.058 


Mystery 


0.0 


0.392 


0.401 


0.395 


0.424 


0.469 


Document. 


0.0 


0.404 


0.392 


0.391 


0.468 


0.489 


Fantasy 


-0.1 


0.064 


0.070 


0.051 


0.243 


0.201 


Animation 


-0.2 


0.032 


0.033 


0.001 


0.027 


0.018 


Musical 


-0.5 


0.001 


0.001 


0.001 


0.001 


0.001 


Cliildren's 


-1.0 


0.001 


0.001 


0.001 


0.001 


0.001 


Comedy 


-1.3 


0.001 


0.001 


0.001 


0.001 


0.001 


Drama 


-2.3 


0.001 


0.001 


0.001 


0.001 


0.001 


Romance 


-2.3 


0.001 


0.001 


0.001 


0.001 


0.001 



Table 4: Empirical p-values for Hyp [3] The values for the 
associated Statistic [3] in the original relations are given in 
the second column. The different randomizations methods 
(columns 3rd to 7th) correspond to randomizing in one re- 
lation at a time from SUn/i nUMn/2 ><iMG. Genres are 
sorted by the value of the statistic. Significance tests say: 
genres over the first dashed line are more watched by men 
(p-values always under 0.05); genres under the second dot- 
ted line are more watched by women (p- values always under 
0.05). We cannot say anything about genres in between the 
two dotted lines. 

For the other randomizations we have that all occupations, 
except for homemakers, exhibit significance of the hypothe- 
sis. We observe that the largest occupation groups of librar- 
ians (51), educators (95) and students (196) have the most 
significant empirical p-values for the query, with all type 
of randomizations. We could infer that those type of users 
watch different genres than other users. 

Hyp 5 Average age of the users who have watched movies 
of a given genre is significant. 

Statistic 5 Weighted average age of the users who have 
watched movies of the given genre. 

The results of assessing Hyp|5]are given in Tabled The em- 
pirical p-values of the queries depend largely on the type of 
randomization used. By randomizing the ages of the users, 
that is, sw(AU) kUMmMG, the movies whose average age 
of watchers has originally been around 34 years are not sig- 
nificant. This makes sense when it is compared to the aver- 
age of all users which is 34. 1 years. Notice that in the query 
the average is weighted by the number of movies watched 
by the user. Thus randomizing the table AU tests the con- 
nection between the ages and the users. Other random- 
ization points tell us that the results on western, romance. 



sw(OU) SW(J2) 





Orig. 


Mean (Std) 


p-val. 


Mean (Std) 


p-val. 


None 






0.038 


07 ("0 n 


001 


T ir\i"QfiQn 


0.18 


O'S (0 (V^ 


0.001 


04 ("0 n 


001 


R ptirpH 


0.18 


1 ro nA^ 


0.040 


05 (0 on 


0.001 


n. Ln 1 ic i i idivci 


1 7 


14. m ns'i 


9(SQ 


U.U \\J.\JD) 


0.226 






14. m D'S'l 


^7^ 


08 (0 0')\ 


001 




0.14 


09 (0 03^1 


0.073 


04 (0 on 

\J.\J'-T \\J.\Jl.J 


0.001 


H n ci t r»i" 


0.13 


\J.\J'-T \\J.\JL ) 


0.001 


0^ (•n on 


001 


T CtWI\Tf^V 

J_(ClW y Ci 


0.13 




9^7 


OS on 


001 


S al pern an 


0.12 




0.330 


06 CO n 


0.001 


l-f p fi 1 1 Vi p fi r p 


0.12 


09 (0 03') 


0.211 


04 (0 on 


0.001 


vtnripnf 
O LUUCllL 






001 


0^ on 


001 


SplPTltl ^\ 


0.11 


07 (0 OT) 


0.052 


05 (0 on 


0.001 


Artist 


n 10 

U. iU 


07 10 O^'l 


1 ^0 


04 m n n 


001 


Tf> r*h n 1 r* 1 Q n 
ICdllllUldli 


10 


07 10 O^'l 


1 8^ 


0^ (0 nn 


001 


Prnorammpr 

1 1 Dgl dlllllltl 


0.08 


05 (0 0?^ 


0.025 


03 CO on 


0.001 


Engineer 


0.08 


0.05 (0.02) 


0.034 


0.03 (0.01) 


0.001 


Marketing 


0.08 


0.07 (0.03) 


0.340 


0.05 (0.01) 


0.006 


Writer 


0.08 


0.06 (0.02) 


0.122 


0.03 (0.01) 


0.001 


Executive 


0.07 


0.07 (0.02) 


0.337 


0.04 (0.01) 


0.001 


Administr. 


0.05 


0.04 (0.02) 


0.367 


0.02 (0.01) 


0.001 


Other 


0.04 


0.04 (0.01) 


0.483 


0.02 (0.00) 


0.002 



Table 5: Empirical p- values for Hyp|4] The original val- 
ues of Statistic H] with mean and std over 999 randomized 
samples are given. The results on randomizations OUm 
sw(/i)nUMixMG were similar to sw(OU) nUMnMG, 
whereas the results on OUmsw(UM) nMG and OUkUMn 
sw(MG) were similar to 0UixiUMixisw(/2) xMG. Bold -p- 
values are significant with sw(OU) and nonsignificant with 

Sw(/2). 

crime and fantasy are not significant, whereas the results 
on other genres are significant. Thus the inner structure of 
the User x Movie and Movie x Genre relations explain the re- 
sults of our query. The average ages of the users of the gen- 
res with a star in Table |6] were significant with all types of 
randomizations. 



7 Related work 

Obviously, there is a large amount of statistical literature 
about hypothesis testing |3 9|. For the particular case of 
data mining, many papers work on the significance of asso- 
ciation rules and other patterns llT4l[T5l . In the recent years, 
the framework of randomizations has been introduced to the 
data mining community to test significance of patterns: the 
papers (H] H) deal with randomizations on binary data, and 
the work in fl2l studies randomizations on real- valued data. 
For another type of approach to measuring p-values for pat- 
terns, see |16|. A related work that studies permutations 
on networks and how this affects significance of patterns 
is 1, 11 J . Sub-sampling methods such as bootstrapping Q 
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Orig. 


SW(AU) 


SW(UM) 


SW(J2) 


SW(MG) 


Film-noir* 


35.8 


0.001 


0.001 


0.003 


0.001 


Documentary 


35.0 


0.134 


0.001 


0.001 


0.001 


Mystery 


34.3 


0.197 


0.001 


0.004 


0.001 


War 


34.2 


0.308 


0.001 


0.004 


0.001 


Drama 


34.1 


0.493 


0.001 


0.001 


0.001 


Western 


33.8 


0.307 


0.001 


0.168 


0.060 


Romance* 


33.4 


0.024 


0.001 


0.039 


0.002 


Musical 


33.0 


0.016 


0.253 


0.469 


0.257 


Crime 


32.6 


0.001 


0.001 


0.181 


0.411 


Comedy* 


32.5 


0.001 


0.001 


0.003 


0.007 


Thriller* 


32.2 


0.001 


0.001 


0.003 


0.004 


Adventure* 


32.0 


0.001 


0.001 


0.001 


0.006 


Fantasy 


32.0 


0.002 


0.001 


0.130 


0.164 


Children's* 


31.8 


0.001 


0.001 


0.002 


0.001 


Sci-fi* 


31.8 


0.001 


0.001 


0.001 


0.003 


Action* 


31.7 


0.001 


0.001 


0.001 


0.001 


Horror* 


31.1 


0.001 


0.001 


0.001 


0.001 


Animation* 


30.9 


0.001 


0.001 


0.004 


0.002 



Table 6: Empirical p- values for Hyp |5] The results on 
randomizations AUxisw(/i) xUMxMG were similar to 
sw(AU) mUMixiMG. Genres with a star are significant with 
all randomizations. Bold p-values are non-significant. 



use randomization to study the properties of the underlying 
distribution instead of testing the data against some null- 
model. Finally, database theory studies mainly query pro- 
cessing and optimization in different complex data lf6l [TOl . 
To the best of our knowledge there is no work that directly 
addresses the problem presented in this paper. 

8 Conclusions and future work 

We have addressed the problem of assessing the significance 
of queries made for the exploratory analysis of relational 
databases. Each query, together with the associated statis- 
tic, define the hypothesis to test on our data. Our math- 
ematical tool to decide the significance is via randomiza- 
tions. It turns out that in multi-relational data there is no 
unique way to randomize. We propose to randomize tables 
occurring in the queries one at a time, and obtain a set of 
p-values for each randomization. Each p-value teUs what is 
the structural impact of the randomized table in the query. 
For example, if certain structures or patterns remain after 
the randomizations, the answers of a query that rely on such 
patterns should not be significant. Experiments with syn- 
thetic data showed that for well defined significant patterns 
randomizations uncover which tables from our database are 
key in significance testing. For real datasets, we tested sev- 
eral hypothesis to show the usability of the method. Still, we 
found out that in real data it is difficult to give a fully satis- 
factory answer about how to use all the obtained p- values to 
conclude the correct inference. Our contribution makes an 



important first step towards understanding how the structure 
hidden in the data makes some hypotheses more significant 
than others, but still, a lot of interesting future work needs to 
be done: study of the combinatorial properties and its con- 
nection to the significance of queries and patterns. 
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